Detecting Complex Predicates In Hindi Using POS Projection Across Parallel Corpora

نویسندگان

  • Amitabha Mukerjee
  • Ankit Soni
  • Achla M. Raina
چکیده

Complex Predicates or CPs are multiword complexes functioning as single verbal units. CPs are particularly pervasive in Hindi and other IndoAryan languages, but an usage account driven by corpus-based identification of these constructs has not been possible since single-language systems based on rules and statistical approaches require reliable tools (POS taggers, parsers, etc.) that are unavailable for Hindi. This paper highlights the development of first such database based on the simple idea of projecting POS tags across an English-Hindi parallel corpus. The CP types considered include adjective-verb (AV), noun-verb (NV), adverb-verb (Adv-V), and verb-verb (VV) composites. CPs are hypothesized where a verb in English is projected onto a multi-word sequence in Hindi. While this process misses some CPs, those that are detected appear to be more reliable (83% precision, 46% recall). The resulting database lists usage instances of 1439 CPs in 4400 sentences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aligning Sentences and Words Using English-hindi Bilingual Parallel Corpora

This dissertation project relates to language engineering issues. The Enabling Minority Language Engineering (EMILLE) project is a collaborative work of The University of Sheffield and The Lancaster University. It aims to develop sixty-three million word electronic corpus of the South Asian Languages. As part of the EMILLE project, it was decided to develop a POS tagger for one of the languages...

متن کامل

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

Complex predicate is a noun, a verb, an adjective or an adverb followed by a light verb that behaves as a single unit of verb. Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family. Detecting and interpreting CPs constitute an important and somewhat a difficult task. The linguistic and statistical methods have yielded limited success in mining this data....

متن کامل

Morphological Richness Offsets Resource Demand - Experiences in Constructing a POS Tagger for Hindi

In this paper we report our work on building a POS tagger for a morphologically rich languageHindi. The theme of the research is to vindicate the stand thatif morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. The methodology mak...

متن کامل

Complex Predicates in Indian Language Wordnets

Wordnets, which are repositories of lexical semantic knowledge containing semantically linked synsets and lexically linked words, are indispensable for work on computational linguistics and natural language processing. While building wordnets for Hindi and Marathi, two major IndoEuropean languages, we observed that the verb hierarchy in the Princeton Wordnet was rather shallow. We set to constr...

متن کامل

Creating Multilingual Parallel Corpora in Indian Languages

This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals creating parallel sentence aligned corp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006